https://github.com/hbhjj/IODS-project
The aim of this course is to get familiarized with open data science. This includes open data and open source tools.
This week I studied doing regression analysis with R in the IODS-course. Regression analysis is a method for estimating relationships between variables. In this exercise, we will look at the connection of learning strategies and attitude to the course points of one statistics course
More detailed description of the dataset can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt
lets read the data from the disk
learning2014 <- read.table(file= 'data/learning2014.txt', header=T)
Dimensions and structure of the data
dim(learning2014)
## [1] 166 7
There are 166 observations and 7 variables in the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.75 2.88 3.88 3.5 3.75 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
The 7 variables in the dataframe are gender, Age, attitude, deep, stra, surf and Points
Lets check the summary and visualize the data
If not already installed, you should run the command install.packages(c(“ggplot2”,“GGally”)) to install required packages for data visualization
The summaries of the variables
summary(learning2014)
## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.625 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.500 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.875 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.796 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.250 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.875 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
draw a scatter plot matrix of the variables in learning2014. [-1] excludes the first column (gender)
pairs(learning2014[-1], col=learning2014$gender)
create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
draw the plot
p
It looks like Points is correlated most heavily with attitude, stra and surf. We will use these variables later in our regression model. Students tend to be fairly young and there is more females than males, which is not that suprising - these are university students, after all. Deep learning has its peak on the higher scores.
Next, we should build a model with the variables described above.
Lets create a regression model with attitude, stra and surf as explanatary variables, Points to be explained and print a summary of it
r_model <- lm(Points ~ attitude + stra + surf, data = learning2014)
summary(r_model)
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
stra and surf did not have statistically significant connection to Points, attitude did. As it is now, the model explains around 0.19 percent of variability of Points is explained by the model according to adjusted R-squared value. 1 point increase in attitude implies a 3.4 point increase in Points if one believes the model. Intercept is around 11, so if explanatory variables are 0, Points should be around 11.
As a final touch, lets see graphical analysis of the results by plotting Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow=c(2,2))
plot(r_model, which= c(1,2,5))
Residuals vs Fitted seems to be fairly random, so the errors should not depend on explanatory variables Q-Q plot seems such that the errors of the model seem fairly normally distributed. There is some deviation in the beginning and the end, but it does not seem dramatic. Residuals vs Residuals plot shows some outliers, but their leverage is not drastically high compared to other datapoints.
This week, we will explore student alcohol usage. The dataset has, for example, questions related to alcohol usage, social relationships, health and studying. Description of the data can be found here https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION
load dplyr, GGally, tidyr and ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)
read the data from data folder
alc <- read.table(file = 'data/alc.txt', header = T)
glimpse the data
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
Data has 382 observations and 35 variables.
I chose to see if quality of family relationships, family support of education, going out and current health status affect heavy drinking. Rationale behind the choice of first two variables is that may be plausible that social support makes heavy drinking less probable. Going out with friends offers opportunities to drink and there might be social pressure to do so. Overall bad health may lead to drinking. Out of the four variables, family support is binary variables. Family relationships, going out and health are measured on 5 point likert skale.
Now, lets do a subset of data that only has the variables we are interested in and draw bar plots out of them.
alc_sub <- alc[,c("famsup","famrel","goout","health", "high_use")]
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc_sub) %>% glimpse
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Observations: 1,910
## Variables: 2
## $ key <chr> "famsup", "famsup", "famsup", "famsup", "famsup", "famsu...
## $ value <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
# draw a bar plot of each variable
gather(alc_sub) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped
Participants tend to have fairly good family relationships. Most have family support for their studies. Health is skewed towards people feeling very healthy: it is not normally distributed. There is less people in high alcohol usage -category. Going out is fairly normally distributed.
Lets see explore the choices a bit
# initialize a plot of high_use and family relations
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("Family relationship")
Good family relationships seem to be connected to not being a high user of alcohol, as hypothesis went
# initialise a plot of high_use and health
g2 <- ggplot(alc, aes(x = high_use, y = health))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Health")
Maybe a bit suprisingly, health seems to be similar for both high users and low users. This might be due to participants feeling quite healthy overall.
Then, going out and drinking
# initialise a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Going out")
Those goint out with friends more tend to drink more. Not that suprising.
Then, lets see how family support and alcohol usage compare. Let’s do a crosstab out of them
#create crosstab from high use and famsup
xtabs(~alc$high_use+alc$famsup)
## alc$famsup
## alc$high_use no yes
## FALSE 98 170
## TRUE 46 68
Overall, people not using a lot of alcohol and having family support are the biggest group. Interestingly, proportianally those not getting family support in high use -group are bigger compared to all in high use than in the not high use group.
We will use logistic regression as a method of choice. Most often this method is used to predict variables that are binary, such as our high_use is. Now we will fit the variables chosen above to regression model and print out a summary of it.
alcm <- glm(high_use ~ famrel + famsup + goout + health, data = alc, family = "binomial")
summary(alcm)
##
## Call:
## glm(formula = high_use ~ famrel + famsup + goout + health, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5527 -0.8023 -0.5594 0.9769 2.4393
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.42198 0.69344 -3.493 0.000478 ***
## famrel -0.40343 0.13648 -2.956 0.003117 **
## famsupyes -0.20366 0.25136 -0.810 0.417821
## goout 0.80717 0.11913 6.776 1.24e-11 ***
## health 0.16986 0.09045 1.878 0.060383 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 404.08 on 377 degrees of freedom
## AIC: 414.08
##
## Number of Fisher Scoring iterations: 4
Okay, so only family relationship and going out has significance. Lets drop everything else and do the model again with only them as a predictor and print the coefficients.
alcm2 <- glm(high_use ~ famrel + goout, data = alc, family = "binomial")
summary(alcm2)
##
## Call:
## glm(formula = high_use ~ famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5448 -0.7518 -0.5252 0.9872 2.3530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0289 0.6148 -3.300 0.000966 ***
## famrel -0.3667 0.1338 -2.740 0.006142 **
## goout 0.7922 0.1183 6.698 2.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 408.12 on 379 degrees of freedom
## AIC: 414.12
##
## Number of Fisher Scoring iterations: 4
coef(alcm2)
## (Intercept) famrel goout
## -2.0288785 -0.3666676 0.7921642
As the family relationship gets poorer, likelyhood to use large amounts of alcohol increases. As person goes out more, the likelihood of being a high user of alcohol increases.
Next, lets see confidence intervals for the coefficients as odds ratio.
OR <- coef(alcm2) %>% exp
CI <- exp(confint(alcm2))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1314829 0.03816207 0.4283774
## famrel 0.6930400 0.53161279 0.8999454
## goout 2.2081701 1.76229526 2.8045361
Going out seems to quite heavily increase the tendency for high alcohol consumption. As famrel odds ratio is less than one, it is negatively associated with high alcohol consumption: better family relationships imply a tendency to drink less. Neither of confidence intervals includes 1 which would mean equal odds and thus would indicate unreliability.
How well does the model work in prediction? Maybe the following section can answer to that.
First, confusion matrix.
probabilities <- predict(alcm2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 25
## TRUE 68 46
Model looks suprisingly good. Let’s plot it.
logmodg <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
logmodg + geom_point()
Okay, how good the model then actually is? Loss function might tell that to us. First though, lets see a table of prediction results
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63612565 0.06544503 0.70157068
## TRUE 0.17801047 0.12041885 0.29842932
## Sum 0.81413613 0.18586387 1.00000000
So the model predicted FALSE when it actually was FALSE 64% of time and TRUE when it was TRUE 12% of time.
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2434555
So overall, the model is wrong 24% of the guesses. This is actually not that bad.
How would this model compare to one where every prediction would be most common category, that of FALSE? Lets find out by creating a new variable only having FALSE guesses and use the loss function with it.
alc$all_false[alc$prediction == TRUE] <- FALSE
alc$all_false[alc$prediction == FALSE] <- FALSE
loss_func(class = alc$high_use, prob = alc$all_false)
## [1] 0.2984293
It would seem that the model is better than just guessing FALSE all the time, but one can get pretty good results with that guess also.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = alcm2, K = 10)
cv$delta[1]
## [1] 0.2617801
Using 10-fold cross validation, it would seem that my model is fairly similar than the datacamp-model, that had misses of around 26%. It might be that going out is just too good of a predictor: going out with friends may mean for student population drinking with friends and such is not the most useful predictor.
This week we are going to do clustering.
Load MASS, tidyr, dplyr, corrplot, reshape2 and ggplot2 packages
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Load the Boston data, print out structure, summary, matrix plot and correlation plot of the variables.
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot matrix of the variables
d <- melt(Boston)
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) +
facet_wrap(~variable,scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pairs(Boston)
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
corrplot(cor_matrix, type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle")
The Boston data has 506 rows and 14 columns. The data is about Housing values in Boston and contains variables that may explain some of the variance in the housing values, such as distances to five Boston employment centres and teacher-pupil ratios of the town. Many variables have a bit strange distributions: huge spikes and otherwise fairly low numbers.
As we will be dealing with the crime rate variable, lets discuss it a bit. Crime rate clearly is very high in some areas, crime does not distribute evenly across the ciry
It is not suprising that business areas have higher amount of pollution. Crime rate is negatively correlated with at least distances to employment centres and positively to access to radial highways. Also, richer areas, where the valuable homes are, seem to have less crime.
Explanations of variables, as they are not named intuitively:
crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.
As we are going to use LDA, we need to scale the data so that it is normally distributed and each variable has the same variance.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Create a categorical variable out of crim and drop the original. LDA is used to classify categorical variables, so we will turn the crime rate into such. At this point, I will create a copy of standardized data with original crim for the k-means clustering exercise and name it b_scaled2. This way there is no need to reload the data and do the scaling again later on.
# save the scaled crim as scaled_crim and create copy of scaled data
scaled_crim <- boston_scaled$crim
b_scaled2 <- boston_scaled
# summary of the scaled_crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Create training and test sets out of the scaled data. We will later test our model using the test dataset: we will create the model with the training set. Crime variable is removed from the test dataset.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Fit the linear discriminant analysis on the train set. Here I use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, the results are plotted.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2524752 0.2500000 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 1.01940295 -0.9138673 -0.07742312 -0.8884925 0.45594634
## med_low -0.09083686 -0.2850728 -0.07933396 -0.5744729 -0.13823258
## med_high -0.37516530 0.1872654 0.15646403 0.3927664 0.06885046
## high -0.48724019 1.0171519 -0.03610305 1.0473539 -0.36008855
## age dis rad tax ptratio
## low -0.8673706 0.9018001 -0.7066939 -0.7366214 -0.4431954
## med_low -0.3497999 0.3758294 -0.5382477 -0.4804428 -0.1230137
## med_high 0.4699678 -0.3601570 -0.4451619 -0.3353818 -0.2968492
## high 0.7985438 -0.8446323 1.6377820 1.5138081 0.7803736
## black lstat medv
## low 0.37795309 -0.76314941 0.52059710
## med_low 0.31799448 -0.11020558 -0.00548579
## med_high 0.08986796 0.04686412 0.17632171
## high -0.86973711 0.82762720 -0.67573513
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08836307 0.764886432 -1.028377738
## indus 0.01193862 -0.204711947 0.361078225
## chas -0.02792670 0.009694380 -0.008519928
## nox 0.38251680 -0.691868835 -1.354136435
## rm 0.01551848 -0.007839186 -0.157240348
## age 0.18486839 -0.360002608 -0.264160777
## dis -0.10817326 -0.336677179 0.173941126
## rad 3.33730890 0.911247133 0.140000558
## tax 0.04462571 -0.047383483 0.446678345
## ptratio 0.13911550 0.021466845 -0.473024578
## black -0.11410749 0.029482365 0.153531795
## lstat 0.21839601 -0.255792454 0.341239812
## medv 0.11178389 -0.469169132 -0.230502339
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9502 0.0362 0.0137
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
The plot shows that index of accessibility to radial highways has greatest impact on classification of high crime rate areas.
Now I will predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set. The categories were saved earlier to correct_classes
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 13 1 0
## med_low 4 13 7 0
## med_high 0 5 18 2
## high 0 0 0 27
Model is fairly accurate with its prediction. Noteworthy is that it did not do any mistakes with high crime area predicitions. It is more confused with low and med_low categories.
Now I will do K-means clustering to the scaled b_scaled data that we saved earlier. I will calculate the distances between the observations and run k-means algorithm on the dataset. After this, I will investigate what is the optimal number of clusters and run the algorithm again.
# euclidean distance matrix
dist_eu <- dist(b_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(b_scaled2, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
# k-means clustering
km <-kmeans(dist_eu, centers = 15)
# plot the scaled Boston dataset with clusters
pairs(b_scaled2, col = km$cluster)
#Optimal cluster amount
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
With 15 clusters, it is quite hard to make any sense out of the pairs plot.
After calculating total within sum of squares and plotting it, sharpest drop is between 1 and 2, so 2 is probably the optimal cluster amount.
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# plot the scaled Boston dataset with clusters
pairs(b_scaled2, col = km$cluster)
Index of accessibility to radial highways seems divide these clusters apart. Regarding the crime rate, one cluster has only low crime rate in it, another one both high and low. Proportion of non-retail businesses also seems to be a clear divider. It would be interesting to see these clusters plotted to an actual map.
Load GGally, tidyr, dplyr, corrplot, reshape2 and ggplot2 packages
library(GGally)
library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)
Load the data to R and inspect it a bit
human <- read.table(file= 'data/human.txt', header=T, )
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
The data has 155 observations on 8 variables. Each observation in this data represents different country. Edu2.FM and Labo.FM are proportions of females/males with secondary education and how they make up the labour force. Edu.Exp and Life.Exp are expected lenght of education and life expectancy. GNI is short for Gross Domestic Income. Mat.Mor tells maternal mortality ratio, Ado.Birth adolescent birth rate and Parli.F how many percent of the parliament is made out of women.
Graphical overview and summaries of the variables
# visualize the 'human' variables
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Life expectancy and years spent in school correlate fairly heavily, also with GNI. Larger proportions of educated females in comparison to males also correlate with these variables. Interestingly, the same is not true for female/male labour ratio. I have worked with similar data before, and there variable that measured females in non-agricultural jobs correlated with GNI. I suspect that females in agricultural jobs are the culprit here. Both correlate negatively with maternal mortality. GNI, maternal mortality and adolescent birth rate are all negatively skewed on their distributions.
Next, we will do Principal component analysis (PCA) with non scaled data. PCA converts the data to smaller set of variables (principal components) that do not correlate with each other. It is used to bring out patterns in the data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
#see the summary of the model
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1) ,col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Figure 1. A mess due to non-scaled data
Using the unscaled data does not bring out variance explained by the components. The plot shows that this is most likely due to GNI: it has vastly different scale when compared to other variables so the distance that other variables would create to the model does not matter.
Do the same with standardized data, then.
# standardize the variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# print out summaries of the standardized variables and the model
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca_human
## Standard deviations:
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1) ,col = c("grey40", "deeppink2"))
Figure 2. National wellbeing and female participation dimensions
This is much better now. We can see that all the variables are now on the same scale. First dimension still explains most of the variance (around 54 percent), but not all of it. Ratio of females/males of the workforce is heavily loaded to component 2, as is the percentage of females in the parliament. Plot confirms this. However, Parli.F seems to pull slightly towards the countries with high GNI and overall higher life quality meters such as life expectancy, whilst the labour ratio pulls slightly towards qualities such as high maternal mortality.
Now we will do Multiple Correspondence Analysis, which can be used to detect patterns or structure in the data. It can be also used to reduce data dimensions. We will use data that has answers to questionnaire related to tea consumption.
library(FactoMineR)
#load the tea data
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
I want to simplify the dataset a bit, so I will take 12 first variables of the dataset. This is due to these variables holding when and where -type of data, with one variable indicating if drinking tea is done with friends. It is interesting to see if this sort of spatio-temporal mixture of data with a hint of social factors leads to some revelations
library(reshape2)
tea_sub <- tea[,1:12]
d <- gather(tea_sub)
## Warning: attributes are not identical across measure variables; they will
## be dropped
ggplot(d,aes(value)) +
facet_wrap("key",scales = "free") +
geom_bar()
Most people do not enjoy tea during dinner, and tend to drink it home. Drinking tea at brakfast divides data fairly evenly. Lunch is not tea time for many. Tea is often enjoyed with friends, but not in a pub.
Let’s create the model.
#create mca model
mca <- MCA(tea_sub, graph = FALSE)
# print statistics of the model
summary(mca)
##
## Call:
## MCA(X = tea_sub, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.170 0.110 0.099 0.090 0.088 0.080
## % of var. 17.038 11.006 9.935 9.027 8.833 7.963
## Cumulative % of var. 17.038 28.044 37.979 47.006 55.839 63.801
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.075 0.066 0.062 0.059 0.056 0.043
## % of var. 7.482 6.629 6.204 5.934 5.624 4.325
## Cumulative % of var. 71.284 77.913 84.117 90.051 95.675 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.571 0.638 0.571 | -0.346 0.363 0.210 | -0.032
## 2 | -0.571 0.638 0.571 | -0.346 0.363 0.210 | -0.032
## 3 | 0.134 0.035 0.009 | 0.641 1.243 0.206 | -0.220
## 4 | -0.900 1.583 0.488 | 0.394 0.470 0.094 | 0.046
## 5 | -0.290 0.164 0.104 | -0.040 0.005 0.002 | 0.179
## 6 | -0.900 1.583 0.488 | 0.394 0.470 0.094 | 0.046
## 7 | -0.122 0.029 0.036 | -0.389 0.458 0.364 | -0.202
## 8 | -0.265 0.137 0.111 | 0.060 0.011 0.006 | -0.302
## 9 | -0.358 0.251 0.243 | -0.527 0.840 0.525 | -0.174
## 10 | -0.130 0.033 0.017 | -0.129 0.051 0.016 | -0.139
## ctr cos2
## 1 0.003 0.002 |
## 2 0.003 0.002 |
## 3 0.162 0.024 |
## 4 0.007 0.001 |
## 5 0.107 0.040 |
## 6 0.007 0.001 |
## 7 0.137 0.098 |
## 8 0.306 0.145 |
## 9 0.101 0.057 |
## 10 0.065 0.019 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | 0.148 0.515 0.020 2.460 | -0.752 20.547 0.522
## Not.breakfast | -0.137 0.475 0.020 -2.460 | 0.694 18.966 0.522
## Not.tea time | -0.594 7.525 0.273 -9.037 | 0.405 5.420 0.127
## tea time | 0.460 5.833 0.273 9.037 | -0.314 4.201 0.127
## evening | 0.491 4.053 0.126 6.142 | 0.584 8.876 0.179
## Not.evening | -0.257 2.119 0.126 -6.142 | -0.306 4.641 0.179
## lunch | 0.858 5.287 0.127 6.154 | 0.085 0.081 0.001
## Not.lunch | -0.148 0.909 0.127 -6.154 | -0.015 0.014 0.001
## dinner | -1.250 5.346 0.118 -5.928 | 1.395 10.318 0.147
## Not.dinner | 0.094 0.402 0.118 5.928 | -0.105 0.777 0.147
## v.test Dim.3 ctr cos2 v.test
## breakfast -12.491 | -0.100 0.406 0.009 -1.668 |
## Not.breakfast 12.491 | 0.093 0.375 0.009 1.668 |
## Not.tea time 6.164 | 0.302 3.351 0.071 4.605 |
## tea time -6.164 | -0.234 2.597 0.071 -4.605 |
## evening 7.306 | -0.445 5.716 0.104 -5.570 |
## Not.evening -7.306 | 0.233 2.988 0.104 5.570 |
## lunch 0.612 | -1.226 18.486 0.258 -8.788 |
## Not.lunch -0.612 | 0.211 3.177 0.258 8.788 |
## dinner 6.619 | 0.095 0.053 0.001 0.450 |
## Not.dinner -6.619 | -0.007 0.004 0.001 -0.450 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.020 0.522 0.009 |
## tea.time | 0.273 0.127 0.071 |
## evening | 0.126 0.179 0.104 |
## lunch | 0.127 0.001 0.258 |
## dinner | 0.118 0.147 0.001 |
## always | 0.094 0.024 0.490 |
## home | 0.001 0.154 0.010 |
## work | 0.181 0.025 0.082 |
## tearoom | 0.322 0.000 0.012 |
## friends | 0.311 0.068 0.003 |
Dimension 1 and 2 seem to be separated by drinking tea during the dinner time the most according to the variables shown in summary. Drinking tea during dinner time is highly loaded to dim 2 as is not drinking tea at the breakfast. Plotting shows that drinking tea home or not is also important for this dimension. Dimension 1 is more cluttered. Drinking tea in tearooms and with friends is more loaded to this component according to eta2 value. Third dimension seems to be related to drinking tea always.
plot(mca, invisible=c("ind"), habillage = "quali")
Inspired by Ville Harjunen (https://villehar.github.io/IODS-project/), I wanted to also try and draw plot that he used with his project. I removed arrows from the plot for clarity. Here also individuals are shown. It seems that there are people who prefer drinking tea only on dinners, and seldom at home. Dimension may represent also generally people who seldom drink tea in comparison to those who drink it everywhere, all the time, culminating to people who actually visit specialized tea houses.
library("devtools")
library("factoextra")
fviz_mca_biplot(mca, axes = c(1, 2), geom = c("point", "text"),
label = "all")